Document summary

  • This RMarkdown document serves as a reproducible code for generating the Biotechnology per-cell images.
  • This can be fully re-run assuming:
    • Enough disk capacity to download full Biotechnology data bundle,
    • Enough memory to read ome.tif image via RBioFormats package,
    • Similar directory structure as described below, and
    • Mac/linux operating system.
  • The output is the zipped file Biotechnology.zip, which has already been provided on Canvas.

Key considerations

  • Only a randomly selected subset of 1,000 cells’ images are extracted in this document.
  • Cell labels are the result of Graph-based clustering (an unsupervised learning technique) of gene expression data extracted for these cells. Can this be thought of as “ground truth”?


1 Full data description

The data described in this document stems from a new biotechnology: molecule-resolved spatial genomics. In particular we will explore data that has been generated by 10x Genomics Xenium instrument on a fresh frozen mouse brain coronal section - Tiny subset. The technology results in several outputs including:

  • cell morphology image where intensity corresponds to presence of the nucleus of each cell.
  • cell boundaries indicating spatial locations of detected cells
  • RNA abundances (gene expression) of each cell, which has been grouped into 28 distinct clusters, of which cluster labels are provided.

The full data and description can be found in this link.

2 Data and code directory structure

For this reproducible code we assume that this RMarkdown document is saved within the following directory structure:

  • Biotechnology/
    • data_processed/
      • clusters.csv
      • cell_boundaries.csv.gz
      • morphology_focus.tif
    • data_raw/
      • Xenium_V1_FF_Mouse_Brain_Coronal_Subset_CTX_HP_outs.zip
      • <unzipped files>
    • scripts/
      • DATA3888_Biotechnology_generateImages_2024.Rmd (this document)

For you to be able to fully re-run this code you will need to download the contents of data_raw/ separately (see next section).

You are provided this directory structure, with the contents of data_raw/ removed due to the large file size.

3 Raw data bundle

The contents to Xenium_V1_FF_Mouse_Brain_Coronal_Subset_CTX_HP_outs folder is from the Xenium_V1_FF_Mouse_Brain_Coronal_Subset_CTX_HP_outs.zip file (approx 3.5GB), available to download via this LINK, or can be programmatically downloaded using wget into the target directory.

wget https://cf.10xgenomics.com/samples/xenium/1.0.2/Xenium_V1_FF_Mouse_Brain_Coronal_Subset_CTX_HP/Xenium_V1_FF_Mouse_Brain_Coronal_Subset_CTX_HP_outs.zip ../data_raw/
unzip ../data_raw/Xenium_V1_FF_Mouse_Brain_Coronal_Subset_CTX_HP_outs.zip -d ../data_raw/

Note! It is very important to ensure you are working from the correct working directory, i.e. within the scripts folder in the directory structure described above.

4 The EBImage package

EBImage is an R package that is available in the Bioconductor Project. Bioconductor is similar to the Comprehensive R Archive Network (CRAN), in that you can install packages from this repository.

EBImage provides general purpose functionality for image processing and analysis. In the context of (high-throughput) microscopy-based cellular assays, EBImage offers tools to segment cells and extract quantitative cellular descriptors. This allows the automation of such tasks using the R programming language and facilitates the use of other tools in the R environment for signal processing, statistical modeling, machine learning and visualization with image data.

This chapter in Modern Statistics for Modern Biology is a great reference for using EBImage for different types of imaging data.

To install the EBImage package, you can run the chunk below. This will check whether you have the BiocManager package installed, and if not it will install BiocManager. Then, the EBImage package will be installed via the BiocManager::install() function.

Note: if you attempt to run install.packages("EBImage") you may be met with an error! This is because the package is available in Bioconductor and not on CRAN.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("EBImage")

5 Convert cell morphology .ome.tif to .tif format

The raw data bundle contains the cell morphology image in the ome.tif file format. This type of file includes the image pixel intensities as well as additional metadata that is associated with the microscopy experiment. Since we are interested in the image information only, we need to convert to a .tif format to enable further downstream processing with EBImage.

You are given the .tif file in the Processed Data Bundle, but you can see how this was generated in the rest of this section.

Load the EBImage R package.

library(EBImage)

Read in morphology focus .ome.tif image and export out as a .tif into the ../data/ folder. Only do so if the target .tif file does not exist.

Note that if we need to generate the .tif file, we need to first set up the java memory to 10GB and load the RBioFormats package, which is available on Bioconductor development branch and on Github.

tifFile = "../data_processed/morphology_focus.tif"
if (!file.exists(tifFile)) {
    options(java.parameters = "-Xmx10g")
    library(RBioFormats)
    checkJavaMemory()
    img_ome = RBioFormats::read.image("../data_raw/morphology_focus.ome.tif", read.metadata = FALSE,
        normalize = TRUE)

    img = img_ome[[1]]@.Data
    EBImage::writeImage(x = img, files = tifFile, type = "tiff")
}

6 Copy tabular cell data to ../data_processed/ directory

Since the raw data bundle contains many large files, for convenience we have copied two files from the ../data_raw/ directory to the ../data_processed/ directory. This can be done programmatically using the system() function.

system("cp ../data_raw/cell_boundaries.csv.gz ../data_processed/cell_boundaries.csv.gz")
system("cp ../data_raw/analysis/clustering/gene_expression_graphclust/clusters.csv ../data_processed/clusters.csv")

7 Read cell morphology image data

Read and display the morphology image. Display requires some scaling of the intensities according to the distribution of the intensities, to the 99th percentile.

img = EBImage::readImage(tifFile)
EBImage::display(img/quantile(img, 0.99))

8 Read cell segmentation data

Cell segmentation is provided in the data bundle as a .csv file containing the vertices around each cells’ boundary. The coordinates of the boundaries need to be converted between micrometres (um) and pixels. This scaling factor can be found in the ../data_raw/experiment.xenium file under “pixel_size”.

Note that read.csv can read a Gzip compressed file.

cell_boundaries = read.csv("../data_processed/cell_boundaries.csv.gz", header = TRUE)
cell_boundaries$vertex_x_trans = cell_boundaries$vertex_x/0.2125
cell_boundaries$vertex_y_trans = cell_boundaries$vertex_y/0.2125
head(cell_boundaries)
##   cell_id vertex_x vertex_y vertex_x_trans vertex_y_trans
## 1       1 1901.875 2526.413           8950          11889
## 2       1 1901.450 2537.038           8948          11939
## 3       1 1900.175 2539.375           8942          11950
## 4       1 1896.562 2539.800           8925          11952
## 5       1 1885.938 2537.887           8875          11943
## 6       1 1882.963 2542.775           8861          11966

9 Read gene expression-based cluster labels for all cells

The imaging data also contains gene expression information, that has been used to perform graph-based clustering. We read this data in via read.csv.

clusters = read.csv("../data_processed/clusters.csv")
head(clusters)
##   Barcode Cluster
## 1       1       6
## 2       2       6
## 3       3       2
## 4       4       4
## 5       5       6
## 6       6       4
ncells = nrow(clusters)
ncells
## [1] 36553

10 Generate per-cell images for 1,000 randomly selected cells

In this code chunk, we extract morphology images for 1,000 random cells. For each cell, we subset the morphology image to the rectangle of pixels that cover the cell segmentation boundary.

To get a sense of the variety of cell morphologies, we visualise the first five randomly selected cells.

set.seed(2024)

ncells_subset = 1000

cells_subset = sample(ncells, ncells_subset)
table(clusters[cells_subset, "Cluster"], useNA = "always")
## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
##   75   89   91   64   62   65   53   40   26   37   45   33   34   41   23   35 
##   17   18   19   20   21   22   23   24   25   26   27   28 <NA> 
##   19   17   18   24   17   21   16   11   12   12   11    9    0
for (i in cells_subset) {

    # extract the boundary vertices for the selected cell
    bounds_i = subset(cell_boundaries, cell_id == i)

    # extract the cluster value for the selected cell
    clustval_i = clusters[i, "Cluster"]

    # extract the pixel intensities for the area covering the cell boundary
    img_sub = img[min(bounds_i$vertex_x_trans):max(bounds_i$vertex_x_trans), min(bounds_i$vertex_y_trans):max(bounds_i$vertex_y_trans)]

    # normalise the pixel intensities according to 99th percentile
    img_sub_norm = img_sub/quantile(img_sub, 0.99)

    # as an example, display the image for the first selected cell
    if (i %in% cells_subset[1:5]) {
        print(paste0("displaying image for cell ", i))
        EBImage::display(img_sub/quantile(img_sub, 0.99))
    }

    # if there is no folder for cell_images, create one
    if (!file.exists("../data_processed/cell_images/")) {
        system("mkdir ../data_processed/cell_images/")
    }

    # if there is no folder for the cluster, create one
    clustval_i_directory = paste0("../data_processed/cell_images/cluster_", clustval_i)
    if (!file.exists(clustval_i_directory)) {
        system(paste0("mkdir ", clustval_i_directory))
    }

    # save the extracted image as a png file
    EBImage::writeImage(x = img_sub_norm, files = paste0(clustval_i_directory, "/cell_",
        i, ".png"), type = "png")

}
## [1] "displaying image for cell 21029"

## [1] "displaying image for cell 19872"

## [1] "displaying image for cell 7802"

## [1] "displaying image for cell 33803"

## [1] "displaying image for cell 19362"

11 Create cell_images.zip processed data file

The contents of the data_processed/cell_images folder can then zipped into a file to be shared separately, with the following commands in the terminal. The first command changes the working directory to ../data_processed/ and the next command creates the cell_images.zip file, containing all the contents of the cell_images/ folder.

cd ../data_processed/
zip -r cell_images.zip cell_images/*

12 Finish

sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.5.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Berlin
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] EBImage_4.43.0
## 
## loaded via a namespace (and not attached):
##  [1] cli_3.6.1           knitr_1.43          rlang_1.1.1        
##  [4] xfun_0.39           highr_0.10          tiff_0.1-11        
##  [7] png_0.1-8           jsonlite_1.8.5      RCurl_1.98-1.12    
## [10] htmltools_0.5.5     formatR_1.14        sass_0.4.6         
## [13] locfit_1.5-9.8      rmarkdown_2.22      grid_4.3.1         
## [16] evaluate_0.21       jquerylib_0.1.4     abind_1.4-5        
## [19] bitops_1.0-7        fastmap_1.1.1       yaml_2.3.7         
## [22] compiler_4.3.1      htmlwidgets_1.6.2   fftwtools_0.9-11   
## [25] rstudioapi_0.14     lattice_0.21-8      digest_0.6.31      
## [28] R6_2.5.1            bslib_0.5.0         tools_4.3.1        
## [31] jpeg_0.1-10         BiocGenerics_0.47.0 cachem_1.0.8
knitr::knit_exit()